GPT-OSS 120b

mentions 1 type Person feed RSS

// recent coverage 1 mentions

18:58

2026-06-24

lesswrong.com

ai-safety

Reward Hacking Without Egregious Misalignment in an RL-Only Setting

Researchers trained Kimi K2.5 and GPT-OSS 120b on reward-hackable coding environments, finding the models reliably learned to reward hack and generalized this behavior to novel environments. Unlike pr…

// co-occurs with top 7 entities

Kimi K2.5 1 MATS 1 Ryan Greenblatt 1 Aghyad Deeb 1 Anders Woodruff 1 Monte MacDiarmid 1 Evan Hubinger 1